First catch your corpus: methodological challenges in constructing a thematic corpus
نویسندگان
چکیده
منابع مشابه
Automatically Constructing a Corpus of Sentential Paraphrases
An obstacle to research in automatic paraphrase identification and generation is the lack of large-scale, publiclyavailable labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Microsoft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase. The cor...
متن کاملGetting to Know Your Corpus
Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read it, but corpora are mostly too big to read (and not designed to be read). We show, with examples, how keyword lists (of one corpus vs. another) are a direct, practical and fascinating way to explore the characteristics of corpora, and of text types. Our method is to classify the top one hundred keywo...
متن کاملThematic Analysis and Visualization of Textual Corpus
The semantic analysis of documents is a domain of intense research at present. The works in this domain can take several directions and touch several levels of granularity. In the present work we are exactly interested in the thematic analysis of the textual documents. In our approach, we suggest studying the variation of the theme relevance within a text to identify the major theme and all the...
متن کاملMultilingwis – Explore Your Parallel Corpus
We present Multilingwis2, a web based search engine for exploration of word-aligned parallel and multiparallel corpora. Our application extends the search facilities by Clematide et al. (2016) and is designed to be easily employable on any parallel corpus comprising universal part-of-speech tags, lemmas and word alignments. In addition to corpus exploration, it has proven useful for the assessm...
متن کاملTowards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics
Corpus annotation—adding interpretive information into a collection of texts—is valuable for a number of reasons, including the validation of theories of textual phenomena and the creation of corpora upon which automated learning algorithms can be trained. This paper outlines the main challenges posed by human-coded corpus annotation for current corpus linguistic practice, describing some of th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Corpora
سال: 2018
ISSN: 1749-5032,1755-1676
DOI: 10.3366/cor.2018.0145